Skip to content

Implement double config for AdaptiveCrawler#1683

Closed
Vaccarini-Lorenzo wants to merge 2 commits intounclecode:developfrom
Vaccarini-Lorenzo:main
Closed

Implement double config for AdaptiveCrawler#1683
Vaccarini-Lorenzo wants to merge 2 commits intounclecode:developfrom
Vaccarini-Lorenzo:main

Conversation

@Vaccarini-Lorenzo
Copy link
Copy Markdown

Summary

Proposed solution to fix Issue #1682

List of files changed and why

File impacted: adaptive_crawler.py

AdaptiveConfig now supports two configs, one for embeddings and one for chat completion API

    config = AdaptiveConfig(
        strategy=strategy,
        max_pages=20,
        top_k_links=3,
        min_gain_threshold=0.05,
        embedding_llm_config=LLMConfig(
            provider='azure/text-embedding-3-small'
        ),
        # For query generation - use LLM models
        query_llm_config=LLMConfig(
            provider='azure/gpt-4.1'
        )
    )

The configs support provider, base_url and api_token.

How Has This Been Tested?

This is not a breaking change, it just adds an additional config option.
In realtion to Issue #1682 , the new proposed approach would be

"""
Comparison: Embedding vs Statistical Strategy

This example demonstrates the differences between statistical and embedding
strategies for adaptive crawling, showing when to use each approach.
"""

import asyncio
import time
import os
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig, AsyncLogger, LLMConfig
from crawl4ai.async_logger import LogLevel
import litellm

litellm._turn_on_debug()

logger = AsyncLogger(verbose=False, log_level=LogLevel.ERROR)

async def crawl_with_strategy(url: str, query: str, strategy: str):

    """Helper function to crawl with a specific strategy"""

    config = AdaptiveConfig(
        strategy=strategy,
        max_pages=20,
        top_k_links=3,
        min_gain_threshold=0.05,
        embedding_llm_config=LLMConfig(
            provider='azure/text-embedding-3-small',
            api_token='',
        ),
        # For query generation - use LLM models
        query_llm_config=LLMConfig(
            provider='azure/gpt-4.1',
            api_token='',
        )
    )
    
    async with AsyncWebCrawler(verbose=False, logger=logger) as crawler:
        adaptive = AdaptiveCrawler(crawler, config)
        
        start_time = time.time()
        result = await adaptive.digest(start_url=url, query=query)
        elapsed = time.time() - start_time
        
        return {
            'result': result,
            'crawler': adaptive,
            'elapsed': elapsed,
            'pages': len(result.crawled_urls),
            'confidence': adaptive.confidence
        }


async def main():
    """Compare embedding and statistical strategies"""
    
    # Test scenarios
    test_cases = [
        {
            'name': 'Technical Documentation (Specific Terms)',
            'url': 'https://docs.python.org/3/library/asyncio.html',
            'query': 'asyncio.create_task event_loop.run_until_complete'
        },
        {
            'name': 'Conceptual Query (Semantic Understanding)',
            'url': 'https://docs.python.org/3/library/asyncio.html',
            'query': 'concurrent programming patterns'
        },
        {
            'name': 'Ambiguous Query',
            'url': 'https://realpython.com',
            'query': 'python performance optimization'
        }
    ]

    
    for test in test_cases:
        print("\n" + "="*70)
        print(f"TEST: {test['name']}")
        print(f"URL: {test['url']}")
        print(f"Query: '{test['query']}'")
        print("="*70)
        
        # # Run statistical strategy
        # print("\n📊 Statistical Strategy:")
        # stat_result = await crawl_with_strategy(
        #     test['url'], 
        #     test['query'], 
        #     'statistical'
        # )
        
        # print(f"  Pages crawled: {stat_result['pages']}")
        # print(f"  Time taken: {stat_result['elapsed']:.2f}s")
        # print(f"  Confidence: {stat_result['confidence']:.1%}")
        # print(f"  Sufficient: {'Yes' if stat_result['crawler'].is_sufficient else 'No'}")
        
        # # Show term coverage
        # if hasattr(stat_result['result'], 'term_frequencies'):
        #     query_terms = test['query'].lower().split()
        #     covered = sum(1 for term in query_terms 
        #                  if term in stat_result['result'].term_frequencies)
        #     print(f"  Term coverage: {covered}/{len(query_terms)} query terms found")
        
        # Run embedding strategy
        print("\n🧠 Embedding Strategy:")
        emb_result = await crawl_with_strategy(
            test['url'], 
            test['query'], 
            'embedding'
        )
        
        print(f"  Pages crawled: {emb_result['pages']}")
        print(f"  Time taken: {emb_result['elapsed']:.2f}s")
        print(f"  Confidence: {emb_result['confidence']:.1%}")
        print(f"  Sufficient: {'Yes' if emb_result['crawler'].is_sufficient else 'No'}")
        
        # Show semantic understanding
        if emb_result['result'].expanded_queries:
            print(f"  Query variations: {len(emb_result['result'].expanded_queries)}")
            print(f"  Semantic gaps: {len(emb_result['result'].semantic_gaps)}")
        
        # Compare results
        print("\n📈 Comparison:")
        efficiency_diff = ((stat_result['pages'] - emb_result['pages']) / 
                          stat_result['pages'] * 100) if stat_result['pages'] > 0 else 0
        
        print(f"  Efficiency: ", end="")
        if efficiency_diff > 0:
            print(f"Embedding used {efficiency_diff:.0f}% fewer pages")
        else:
            print(f"Statistical used {-efficiency_diff:.0f}% fewer pages")
        
        print(f"  Speed: ", end="")
        if stat_result['elapsed'] < emb_result['elapsed']:
            print(f"Statistical was {emb_result['elapsed']/stat_result['elapsed']:.1f}x faster")
        else:
            print(f"Embedding was {stat_result['elapsed']/emb_result['elapsed']:.1f}x faster")
        
        print(f"  Confidence difference: {abs(stat_result['confidence'] - emb_result['confidence'])*100:.0f} percentage points")
        
        # Recommendation
        print("\n💡 Recommendation:")
        if 'specific' in test['name'].lower() or all(len(term) > 5 for term in test['query'].split()):
            print("  → Statistical strategy is likely better for this use case (specific terms)")
        elif 'conceptual' in test['name'].lower() or 'semantic' in test['name'].lower():
            print("  → Embedding strategy is likely better for this use case (semantic understanding)")
        else:
            if emb_result['confidence'] > stat_result['confidence'] + 0.1:
                print("  → Embedding strategy achieved significantly better understanding")
            elif stat_result['elapsed'] < emb_result['elapsed'] / 2:
                print("  → Statistical strategy is much faster with similar results")
            else:
                print("  → Both strategies performed similarly; choose based on your priorities")
    
    # Summary recommendations
    print("\n" + "="*70)
    print("STRATEGY SELECTION GUIDE")
    print("="*70)
    print("\n✅ Use STATISTICAL strategy when:")
    print("  - Queries contain specific technical terms")
    print("  - Speed is critical")
    print("  - No API access available")
    print("  - Working with well-structured documentation")
    
    print("\n✅ Use EMBEDDING strategy when:")
    print("  - Queries are conceptual or ambiguous")
    print("  - Semantic understanding is important")
    print("  - Need to detect irrelevant content")
    print("  - Working with diverse content sources")


if __name__ == "__main__":
    asyncio.run(main())

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@ntohidi
Copy link
Copy Markdown
Collaborator

ntohidi commented Feb 25, 2026

@Vaccarini-Lorenzo
Thanks for the PR, separating embedding_llm_config and query_llm_config is the right approach to fix this issue. However, there are a few issues that need to be addressed before we can merge:

Runtime bugs:

  1. _embedding_llm_config_dict was renamed to _llm_config_dict on AdaptiveConfig, but _get_embedding_llm_config_dict() in EmbeddingStrategy still references self.config._embedding_llm_config_dict (the old name). This will raise AttributeError.
  2. _get_query_llm_config_dict() references self.config._query_llm_config_dict, but this property was never added to AdaptiveConfig. Same AttributeError.

Behavioral change:

  1. The old _get_embedding_llm_config_dict() returned None when no config was set, which made get_text_embeddings() use local sentence-transformers (no API key needed). The new fallback defaults to openai/text-embedding-3-small, which would break existing users who don't have an OpenAI API key. The fallback should remain None to preserve the local embedding behavior.

Minor:

  1. Line 182: query_llm_config comment says "Separate config for embeddings" — should say "for query generation".
  2. Lines 618-622: the leftover commented-out code block should be removed.

Could you take a look at these? The main fixes needed are adding the missing _query_llm_config_dict property on AdaptiveConfig, fixing the renamed property reference, and preserving the None fallback for local embeddings.

unclecode added a commit that referenced this pull request Feb 25, 2026
…ion (#1682)

The embedding strategy uses two incompatible API call types: embedding
calls (text-to-vector) and query expansion (chat completion). Previously
both used a single embedding_llm_config, so setting an embedding model
broke query expansion and vice versa.

Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users
can specify separate models for each call type. Fallback chain preserves
backward compatibility: query_llm_config -> llm_config -> hardcoded defaults.

Also fixes base_url and backoff params not being passed to
perform_completion_with_backoff in query expansion, and simplifies
_embedding_llm_config_dict to use LLMConfig.to_dict() (which includes
the 3 backoff fields the manual extraction was missing).

Inspired by PR #1683 from @sthakrar — thank you for identifying the
issue and proposing the initial approach.
@unclecode
Copy link
Copy Markdown
Owner

Hey @Vaccarini-Lorenzo - thank you for filing #1682 and this PR! You identified a real design gap and the query_llm_config approach you proposed was exactly the right solution.

We went ahead and landed this in develop (a4cc0a9) with a clean-room implementation that addresses the issues @ntohidi flagged (missing properties, broken fallback chain, behavioral change on local embeddings). Specifically:

  • Added query_llm_config field on AdaptiveConfig + _query_llm_config_dict property
  • Added _get_query_llm_config_dict() on EmbeddingStrategy with a proper fallback chain: explicit query_llm_config -> AdaptiveConfig -> legacy llm_config -> None (preserves local embedding behavior)
  • Simplified _embedding_llm_config_dict to use LLMConfig.to_dict() (fixes missing backoff params)
  • Fixed base_url and backoff params not being passed to perform_completion_with_backoff in query expansion
  • Added e2e tests and updated docs

The target API is exactly what you proposed:

AdaptiveConfig(
    embedding_llm_config=LLMConfig(provider='openai/text-embedding-3-small'),
    query_llm_config=LLMConfig(provider='openai/gpt-4o-mini'),
)

Since the fix is already on develop, we'll close this PR - but your contribution was instrumental in getting this done. The commit credits you for the original idea. We'd love to see more contributions from you!

Closing in favor of a4cc0a9. Also closing #1682 as fixed.

@unclecode unclecode closed this Feb 25, 2026
@Vaccarini-Lorenzo
Copy link
Copy Markdown
Author

Hi @unclecode
Thank you so much for making my contribution possible, this project is just amazing!

P.s.
I think that you made a typo in commit a4cc0a9

Inspired by PR https://github.com/unclecode/crawl4ai/pull/1683 from <TYPO>@sthakrar</TYPO> <FIX>@Vaccarini-Lorenzo</FIX> — thank you for identifying the
issue and proposing the initial approach.

@unclecode
Copy link
Copy Markdown
Owner

@Vaccarini-Lorenzo Good catch on the typo - that's embarrassing! I'll fix the commit message to properly credit you instead of @sthakrar. Apologies for the mixup.

And thanks again for the contribution - the separate query_llm_config / embedding_llm_config design you proposed was clean and exactly what we needed. Glad to have it in the codebase.

By the way - we're building out Crawl4AI Cloud and starting paid collaborations with contributors who know the system well. If that's something you'd be interested in, send an email to aravind@crawl4ai.com (cc: unclecode@crawl4ai.com) and we can chat.

unclecode added a commit that referenced this pull request Feb 27, 2026
…ion (#1682)

The embedding strategy uses two incompatible API call types: embedding
calls (text-to-vector) and query expansion (chat completion). Previously
both used a single embedding_llm_config, so setting an embedding model
broke query expansion and vice versa.

Add query_llm_config to AdaptiveConfig and EmbeddingStrategy so users
can specify separate models for each call type. Fallback chain preserves
backward compatibility: query_llm_config -> llm_config -> hardcoded defaults.

Also fixes base_url and backoff params not being passed to
perform_completion_with_backoff in query expansion, and simplifies
_embedding_llm_config_dict to use LLMConfig.to_dict() (which includes
the 3 backoff fields the manual extraction was missing).

Inspired by PR #1683 from @Vaccarini-Lorenzo — thank you for identifying the
issue and proposing the initial approach.
@unclecode
Copy link
Copy Markdown
Owner

@Vaccarini-Lorenzo Fixed! The commit message now correctly credits you. Thanks for flagging it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants